Day19-爬蟲實戰2

2019鐵人賽 day19

wayneli

2018-11-02 21:08:44

2817 瀏覽

分享至

上篇文章所開發的爬蟲還有許多不完善的地方，今天我們來針對這些問題進行修正吧！

在開發程式上總是不會一步到位，我們能先將不足的地方條列出來並一一完善他，在這個過程中就能夠暸解自己學習上還不深刻的地方！

Bug

設定抓取Class前20個，當第一頁不滿20筆時會聯同下方的公告一起抓取
輸入類別時不會抓取到Re:的文章

設定抓取Class前20個，當第一頁不滿20筆時會聯同下方的公告一起抓取

修改程式碼如下：

from bs4 import BeautifulSoup
import requests


uri = 'https://www.ptt.cc/bbs/Soft_Job/index.html'
html = requests.get(uri)
soup = BeautifulSoup(html.content, 'html.parser')

content = soup.find('div', class_='r-list-container action-bar-margin bbs-screen')
r_list_sep = content.find('div', class_ = 'r-list-sep')

if r_list_sep == None:
    r_ent_div = content.find_all('div', class_ = 'r-ent') #找出指定的 class 
else:
    r_ent_div = r_list_sep.find_previous_siblings('div', class_ = 'r-ent')
    
category = '[' + input('請輸入要搜尋的類別：') + ']'
i = 0

for item in r_ent_div:
    title = item.find( class_ = 'title')
    if title.find('a'): #過濾掉被刪除的文章
        s = title.find('a')
        title_text = s.string
        date = item.find('div', class_ = 'date')
        if title_text.startswith(category) :
            i = i+1
            print('#{}標題: {} 發文日期： {} \n #連結：https://www.ptt.cc{}'.format(i, title_text, date.string, s.get('href')))

輸入類別時不會抓取到Re:、Fw:的文章

修改判斷字串開頭這段


#if title_text.startswith(category) :
# to
if category in title_text :

新功能

針對連線狀況做判斷
抓取多頁資料

針對連線狀況做判斷

加上連線判斷於開始撈取前，如下：

uri = 'https://www.ptt.cc/bbs/Soft_Job/index.html'
html = requests.get(uri)

if html.status_code != requests.codes.ok:
    print('無法連線網站')
else:    
    soup = BeautifulSoup(html.content, 'html.parser')
    ....

抓取多頁資料

抓取多頁為了將程式可讀性提高，將兩個部分寫成自訂函數

from bs4 import BeautifulSoup
import requests

uri = 'https://www.ptt.cc/bbs/Soft_Job/index.html'

Ptt_Domain = 'https://www.ptt.cc'
lastPage = ''
category = ''
  
#判斷連線狀態
def connection_Check(html):
    if html.status_code != requests.codes.ok:
        return False
    else:
        return True
#撈取網頁資料並列印
def Get_Page_Data(uri) : 
    html = requests.get(uri)
    if connection_Check(html):
        
        soup = BeautifulSoup(html.content, 'html.parser')
        
        menuDiv = soup.find('div', class_ = 'btn-group btn-group-paging')
        lastLinks = menuDiv.find_all('a')
        for lastLink in lastLinks:
            if '上頁' in lastLink.string:
                global lastPage 
                lastPage = lastLink.get('href')
            
        content = soup.find('div', class_='r-list-container action-bar-margin bbs-screen')
        r_list_sep = content.find('div', class_ = 'r-list-sep')
        
        if r_list_sep == None:
            r_ent_div = content.find_all('div', class_ = 'r-ent') #找出指定的 class 
        else:
            r_ent_div = r_list_sep.find_previous_siblings('div', class_ = 'r-ent')
            
        
        i = 0
        for item in r_ent_div:
            title = item.find( class_ = 'title')
            if title.find('a'): #過濾掉被刪除的文章
                s = title.find('a')
                title_text = s.string
                date = item.find('div', class_ = 'date')
                global category
                if category in title_text :
                    i = i+1
                    print('#{}標題: {} 發文日期： {} \n #連結：https://www.ptt.cc{}'.format(i, title_text, date.string, s.get('href')))
                          
        return 
    else:
        print('無法連線網站')
        return 
    
while True:       
    try:
        page = int(input('請輸入要搜尋頁數：')) 
        break
    except:
        print('請輸入數字！')

SearchRange = range(1, page + 1) 
category = '[' + input('請輸入要搜尋的類別：') + ']'

for num in SearchRange:
    print('第{}頁'.format(num))
    if num == 1:
        Get_Page_Data(uri)
    else:
        uri = Ptt_Domain + lastPage
        Get_Page_Data(uri)

執行結果如下：